Holimap: an accurate and efficient method for solving stochastic gene network dynamics

Gene-gene interactions are crucial to the control of sub-cellular processes but our understanding of their stochastic dynamics is hindered by the lack of simulation methods that can accurately and efficiently predict how the distributions of gene product numbers vary across parameter space. To overcome these difficulties, here we present Holimap (high-order linear-mapping approximation), an approach that approximates the protein or mRNA number distributions of a complex gene regulatory network by the distributions of a much simpler reaction system. We demonstrate Holimap’s computational advantages over conventional methods by applying it to predict the stochastic time-dependent dynamics of various gene networks, including transcriptional networks ranging from simple autoregulatory loops to complex randomly connected networks, post-transcriptional networks, and post-translational networks. Holimap is ideally suited to study how the intricate network of gene-gene interactions results in precise coordination and control of gene expression.


Introduction
Genetic regulation occurs through intricate interactions between a number of genes [1][2][3][4].A gene "X" may express a protein which acts as a transcription factor (TF), promoting or inhibiting the RNA polymerase assembly on another target gene "Y" (or on itself) and thus regulating the extent that the latter is expressed [5].These gene-gene interactions can be simply visualised as a directed graph with the genes being the nodes (vertices) and the directed edges (links) representing the interactions [6,7].Networks inferred from gene expression data, commonly called gene regulatory networks [8], have been reconstructed by several methods [9][10][11][12][13].The complex connectivity of these networks makes intuitive understanding of their dynamics challenging.Consequently, the construction, mathematical analysis, and simulation of models of gene regulatory networks are indispensable tools in a quantitative biologist's arsenal.
Several formalisms have been employed to predict gene regulatory network dynamics, including Boolean networks, ordinary differential equations (ODEs), and chemical master equations (CMEs) -for reviews covering these approaches and more see Ref. [14,15].These approaches have various advantages and disadvantages.
In Boolean networks, the expression of each gene is tracked by a binary variable and hence large networks can be studied in a computationally efficiently way.A more refined description is provided by the use of ODEs, where the time-dependent concentrations of RNAs, proteins, and other molecules are predicted as a function of the rate constants of the reactions in the network [16,17].An even more realistic description makes use of the CME approach where one predicts not only the mean expression levels of various genes, but also the distributions of the discrete numbers of mRNAs and/or proteins measured across a population of cells [18].This stochasticity has various sources (biological intrinsic and extrinsic noise, and technical noise introduced by experimental protocols), all of which lead to the large differences in gene expression observed from one cell to another [19][20][21].
Unfortunately, with an increasing level of sophistication and predictive power, simulations also rapidly become computationally expensive.Unravelling the stochastic dynamics of gene networks requires solving a set of coupled CMEs for the probability of the system being in each possible state.Since the number of states of a gene network is typically infinite, direct solution of these equations is impossible; while methods which truncate the state space to a finite one exist, e.g. the finite-state projection (FSP) [22], the immense number of states limits their application to very small networks with one or two interacting genes.For larger networks, Monte Carlo simulations based on the stochastic simulation algorithm (SSA) [23] become more practical, but the issue remains that a large sampling size is typically required to obtain smooth distributions and hence the computational time can still be very considerable.
In this paper, we overcome the difficulties of conventional stochastic simulation methods for gene networks by devising a new efficient approach -the high-order linear-mapping approximation (Holimap).The basic idea is to map the dynamics of a gene network with second or higher-order interactions between TFs and genes (a system with nonlinear propensities and hence a nonlinear network) to the dynamics of a system where all reactions are first-order (a linear network).The reaction rates of this system are generally time-dependent and complex functions of the reaction rates of the original gene network and they are found by conditional moment-matching.The linear network has a much smaller state space than the nonlinear network which means that now simulation using FSP becomes feasible, leading to smooth distributions of protein numbers in a fraction of the time taken by SSA simulations.For an illustration of Holimap see Fig. 1.Illustration of Holimap and its advantages over the SSA.Holimap decouples gene-gene interactions in a nonlinear regulatory network and transforms it into a linear network with multiple effective parameters, some of which may be time-dependent.The time evolution of protein number distributions (for all genes) of the nonlinear network can be approximately predicted by solving the dynamics of the effective linear network using, e.g., FSP.Compared with the conventional Monte Carlo approach (the SSA), Holimap not only significantly reduces the CPU time, but it also yields an accurate, noise-free prediction of the protein number distributions.

Fundamental principles of Holimap illustrated by an autoregulation example
Consider a simple autoregulatory feedback loop [24,25], whereby protein expressed from a gene regulates its own transcription (Fig. 2(a)).Feedback is mediated by cooperative binding of h protein copies to the gene [26][27][28][29].In agreement with experiments [30], protein synthesis is assumed to occur in bursts of random size k sampled from a geometric distribution with parameter p, i.e.
Here σ b is the binding rate of protein to the gene; σ u is the unbinding rate; ρ b and ρ u are the burst frequencies of protein, i.e. the frequencies with which bursts are produced, when gene is in the bound and unbound states, respectively; d is the rate of protein degradation and dilution (due to cell division).The reaction system describes a positive feedback loop when ρ b > ρ u (since in the case, binding of a protein increases its own expression) and describes a negative feedback loop when ρ b < ρ u (binding of a protein decreases its own expression).
Let p i,n denote the probability of having n protein copies in an individual cell when the gene is in state i with i = 0, 1 corresponding to the unbound and bound states, respectively.To proceed, let g i = ∞ n=0 p i,n be the probability of observing the gene in state i and let µ m,i = ∞ n=0 n(n−1) • • • (n−m+1)p i,n be the mth factorial moment of protein numbers when the gene is in this state.For simplicity, we first focus on the case of noncooperative binding (h = 1).From the CME, it is straightforward to obtain the following time evolution equations for the moments: where g 1 = 1 − g 0 and B = k = p/(1 − p) is the mean protein burst size, i.e. the mean number of protein molecules produced in a single burst.For clarity, we have suppressed the explicit time-dependence of all moments.Note that this system of equations is not closed, i.e. the equation for a moment of a certain order depends on moments of higher orders, and hence an exact solution is generally impossible.This difficulty stems from the nonlinear dependence on molecule numbers of the bimolecular propensity modelling protein-gene interactions [31].
In contrast, a linear gene network (one composed of only first-order reactions, i.e. the propensity of each reaction has a linear dependence on molecule numbers) is much easier to solve both analytically and numerically than a gene network with nonlinear propensities; for example, the moment equations are closed and thus can be solved exactly in this case.A basic idea of the linear-mapping approximation (LMA) developed in Ref. [32] is to transform a complex nonlinear network into a linear one by replacing all second or higher-order reactions between proteins and genes by effective first-order reactions.Specifically, for the network in Fig. 2 Here the HD for the LMA represents the Hellinger distance between the real steady-state protein distribution computed using FSP applied to the nonlinear system and the approximate protein distribution computed using the LMA.(f) Heat plots of the HDs for the LMA and Holimaps as functions of the unbinding rate σu and binding rate σ b when ρ b ρu.The red curves enclose the true bimodal region, i.e. the parameter region in which the protein number has a bimodal distribution, as predicted by FSP.The orange curves enclose the bimodal region predicted by the approximation method.The vertical white dashed line demarcates the region of σu ≥ d, where the linear network given by the LMA can never exhibit bimodality, from the region of σu < d where it can.(g) Comparison of the steady-state protein distributions computed using FSP, LMA, 2-HM, and 4-HM in different regimes of gene state switching.(h) The maximum HD as a function of the cooperativity h for the LMA and Holimaps.Here the maximum HD is computed when σu and σ b vary over large ranges, while other parameters remain fixed.See Supplementary Note 1 for the technical details of this figure .one shown in Fig. 2(b), where the binding rate σ b for the former is replaced by the effective gene switching rate σb for the latter, while the other parameters remain unchanged.In the LMA, σb is chosen to be σ b multiplied by the conditional mean of protein numbers in the unbound gene state, i.e.
where g 0 and µ 1,0 can be calculated by a natural moment-closure method (Methods) [32].There are two approximations involved in the LMA: (i) in reality, the effective parameter σb should be stochastic rather than deterministic since it is proportional to the instantaneous protein number in the unbound state; (ii) any moment-closure method inevitably leads to some errors [33].
Next we propose a new method -Holimap, which we will show to perform much better than the LMA.There are two types of Holimaps.The first type is the 2-parameter Holimap (2-HM) which transforms the nonlinear gene network into the linear one illustrated in Fig. 2(c), where both the binding and unbinding rates σ b and σ u for the former are replaced by the effective gene switching rates σb and σu for the latter.The remaining question is how to determine σb and σu so that the solution of the linear network accurately approximates that of the nonlinear one.For the linear network, the evolution of moments is governed by The effective rates σb and σu are chosen so that the two systems have the same zeroth and first-order moment equations (for the latter, we mean the first-order moment when the gene is in the bound state).Matching the first and third identities in Eqs. ( 1) and ( 3), we find that σb and σu should satisfy The remaining question is how to use these equations to obtain formulae for the effective rates.This can be done as follows: we first solve for σb and σu using Eq. ( 4) and then substitute these into Eq.( 3) to obtain a set of closed moment equations.These equations can be solved for the values of all zeroth, first, and second-order moments, i.e. g i , µ 1,i , and µ 2,i .Finally substituting these into Eq.( 4) gives the values of the effective parameters for the linear network, i.e. σb and σu .
In steady state, the values of σb and σu are constants independent of time, and hence we can use the steady-state protein distribution of the linear network to approximate that of the nonlinear one -this can be computed analytically [34] or using FSP.When the system has not reached steady state, the values of σb and σu depend on time t.In this case, we can use the time evolution of the linear network with time-dependent rates to predict that of the nonlinear one -while analytical solutions are not generally available in this case, the distributions can be efficiently computed using FSP.
In some regions of parameter space, the 2-HM may still not be accurate enough.To solve this problem, we devise a second type of Holimap -the 4-parameter Holimap (4-HM), which transforms the nonlinear network into the linear one illustrated in Fig. 2(d).Here the binding rate σ b , unbinding rate σ u , and the protein burst frequencies ρ b and ρ u for the former are replaced by four effective parameters σb , σu , ρb , and ρu for the latter, which can be determined by matching the moment equations for the two networks (Methods).Note that while for the 2-HM, we matched only the zeroth and first-order moments, for the 4-HM, we match these and also the second-order moments.The 2-HM and 4-HM will be collectively referred to as Holimaps in what follows.
Thus far, we have only considered the case of h = 1.For the case of cooperative binding (h ≥ 2), the Holimap approximation procedure can be similarly performed, except that higher-order moment equations need to be solved (Supplementary Note 2) -the algorithm for finding the effective parameters requires the solution of (h + 1)-order moment equations.The computational cost of Holimap is mainly determined by the number of moment equations (L) to be solved.For autoregulatory loops, L = 1 + 2h for the LMA and L = 3 + 2h for Holimap.Note that the 2-HM and 4-HM have the same L.
The principles used to construct Holimaps for autoregulated networks can be used to obtain Holimaps for an arbitrarily complex network consisting of a system of interacting genes that regulate each other via positive or negative feedback.In this case, the computational time of Holimap depends on the complexity of the regulatory network -an increased number of nodes (genes) or edges (regulatory reactions) results in an increased number of moment equations to be solved.

Applications to one-node (autoregulatory) networks
We now assess the performance of Holimap based on the Hellinger distance (HD) between the steady-state protein distribution obtained by applying FSP to the nonlinear network and the approximate distribution computed using the LMA and the two types of Holimaps.Note that while the direct application of FSP also leads to an approximate distribution, in effect it can be considered exact since the error is very small provided the state space is truncated to a large enough value [22].Here we choose the HD because it is bounded between 0 and 1; a visually accurate approximation is obtained when the HD 0.1.For simplicity, the degradation rate of protein is taken as d = 1 throughout the analysis; equivalently, one could re-normalize time and parameters by d [32].
Fig. 2(e) illustrates the HD for the LMA as a function of ρ u and ρ b .Clearly, the LMA performs well when ρ u and ρ b are not very different from each other.However, it results in larger deviations from FSP when the protein burst frequency in one gene state is significantly larger than that in the other.We also find that the LMA is much more accurate for negative feedback loops (ρ u > ρ b ) than for positive feedback loops (ρ b > ρ u ).In the LMA, the effective stochastic parameter σb is approximated by σ b multiplied by the conditional mean of protein numbers in the unbound state.Hence it must give rise to inaccurate approximations when protein noise in the unbound gene state is large.This is exactly what happens in the positive feedback case where the low synthesis rate in the unbound state results in a small conditional mean and thus large protein noise.
We next examine whether Holimap outperforms the LMA when it is applied to positive feedback loops.In Fig. 2(f), we show the HD against σ u and σ b for the LMA, 2-HM, and 4-HM when ρ b ρ u .It is clear that the LMA (Fig. 2(f), left) performs well when σ u and σ b are both small, but it becomes highly inaccurate when σ u and σ b are larger.The protein distribution can be unimodal or bimodal.The bimodal one is of particular interest because it indicates the separation of isogenic cells into two different phenotypes.In particular, we find that the LMA results in poor approximations when σ u ≥ d and when the distribution is bimodal.This can be explained as follows.Recall that the LMA transforms a nonlinear network into a linear one with unchanged σ u , which is commonly known as the telegraph model of stochastic gene expression [35].In Ref. [36], it has been proved that the telegraph model can produce a bimodal steadystate distribution only when both gene switching rates are smaller than the degradation rate (σ u , σb < d).When σ u ≥ d, the linear network can never exhibit bimodality, while the bimodality in the nonlinear network can be apparent.
In contrast to the LMA, both the 2-HM and 4-HM markedly reduce the HD values (Fig. 2(f), center and right).The LMA has a maximum HD of 0.7, while for the two types of Holimaps, the maximum HDs are only 0.2 and 0.16.The 4-HM performs marginally better than the 2-HM in capturing steady-state protein distributions.We also compare the region of parameter space where bimodality is predicted to exist (region enclosed by the orange curves) with the actual region where bimodality manifests according to FSP (region enclosed by the red curves).We note that while the LMA fails to capture the bimodal region of the protein distribution, especially when σ u ≥ d, both the 2-HM and 4-HM capture the vast majority of the bimodal region.In summary, the deficiencies of the LMA for positive feedback loops are remedied by the use of Holimaps (Fig. 2(g)).
Finally, we examine how the cooperativity in protein binding affects the accuracy of various approximation methods.Fig. 2(h) shows the maximum HD as a function of h for the LMA, 2-HM, and 4-HM, where the maximum HD is computed when σ u and σ b vary over large ranges and other parameters remain fixed.Clearly, for the LMA, the maximum HD increases approximately linearly with respect to h when h ≤ 4; for Holimaps, the maximum HD is insensitive to h.Since transcription factor cooperativity is the norm rather than the exception [5], our results suggest Holimap's accuracy remains high over the physiologically meaningful range of parameter values.
The results that we have presented assume steady-state conditions.However, the 2-HM can also accurately reproduce the time evolution of the protein distribution for nonlinear gene networks (Supplementary Fig. 1).The 4-HM is also accurate; however depending on parameter values it may lead to numerical instability at short times and hence the 2-HM might be the preferable choice when dynamics is of major interest.In steady state, while the improvement in accuracy of the 4-HM may be marginal, nevertheless since the two types of Holimaps require the solution of the same number of moment equations, the 4-HM is more advantageous when dynamics is not of interest.

Applications to two-node networks with deterministic mono-and bistability
We next evaluate the performance of Holimaps when applied to study the steady-state behavior of two-node gene networks, where two genes regulate each other (Fig. 3(a), left).Feedback is mediated by cooperative binding of h 1 copies of protein P 1 to gene G 2 and cooperative binding of h 2 copies of protein P 2 to gene G 1 .Here σ bi and σ ui are the binding and unbinding rates for gene G i , respectively; ρ bi and ρ ui are the synthesis rates of protein P i when the gene is in the bound and unbound states, respectively; d i is the degradation rate of protein P i .For simplicity, we do not take protein bursting into account, although it can be included easily.Depending on whether ρ ui < ρ bi or ρ ui > ρ bi for i = 1, 2, there are four different types of effective system dynamics that constitute either a positive feedback or a negative feedback loop (Fig. 3(b)).For example, a toggle switch (two negative regulations) [37] corresponds to the case of ρ u1 > ρ b1 and ρ u2 > ρ b2 .For two-node networks, Holimaps can be obtained in a similar way as we have previously shown for autoregulatory loops, i.e. by replacing all protein-gene binding reactions by effective first-order reactions with new parameters and also allowing some of the other reactions to have different rate constants than those in the original network (Fig. 3(a), center and right).
We first focus on a negative feedback loop without cooperative binding (Fig. 3(c)).Since the LMA performs well when the unbinding rate σ ui is much smaller than the degradation rate d i , here we consider the case of σ ui d i .We use the HD between the actual and approximate steady-state distributions of protein P 1 to test the accuracy of Holimap.Fig. 3(d) illustrates the HDs for the LMA and 4-HM as functions of σ b1 and σ b2 .We find that the network displays bimodality when σ b1 is large and σ b2 is small.This is surprising because in the literature there are two well-accepted origins for bimodality: (i) a positive feedback loop with ultra-sensitivity (type-I) [37] and (ii) slow switching between gene states (type-II), independent of the type of feedback loop [34].Here the network is a negative feedback loop without cooperative binding, and thus there is neither a positive feedback loop nor ultra-sensitivity.Moreover, since both σ u1 and σ b1 are large, gene G 1 switches rapidly between the two states.Hence the bimodality observed is neither type-I nor type-II, and in the following we refer to it as type-III bimodality.
From Fig. 3(d), it is clear that the LMA performs poorly in this bimodal region.Again, the LMA cannot capture type-III bimodality since it transforms the nonlinear network into a linear one with unchanged σ ui , which is unable to produce a bimodal distribution when σ ui ≥ d i [36].On the other hand, the 4-HM significantly reduces the HD values and performs exceptionally well in capturing the bimodal region (Fig. 3(e)).Here we do not show the 2-HM because it leads to similar results as the 4-HM except for a slightly larger HD value.
We next consider a toggle switch with cooperative binding, where two genes repress each other (Fig. 3(f)).Note that this is a positive feedback loop with ultra-sensitivity and hence it can produce deterministic bistability (type-I bimodality), which means that the corresponding system of deterministic rate equations (Supplementary Note 3) is capable of having two stable fixed points and one unstable point.Again, we only focus on the situation of σ ui d i .Fig. 3(g) illustrates the HDs for the LMA and 4-HM against σ b1 and σ b2 .The yellow curve encloses the region of deterministic bistability, which is markedly smaller than the true bimodal region enclosed by the red curve.According to simulations, bimodality can be observed when both σ b1 and σ b2 are large.The LMA fails to reproduce the bimodal distribution since σ ui ≥ d i , as expected.The 4-HM not only successfully captures the bimodal region (enclosed by the orange curve), but also yields small HD values.The maximum HD for the LMA is as large as 0.7, while it is only 0.13 for the 4-HM.In particular, in the deterministically bistable region, both the 2-HM and 4-HM accurately predict the protein distribution while the LMA completely fails (Fig. 3(h)).Here the parameters are chosen so that the system displays deterministic bistability.Note that while we only focus on the distribution of protein P1 in (d),(e),(g), and (h), the distribution of the second protein P2 is also accurately predicted by Holimap (Supplementary Fig. 2).See Supplementary Note 1 for the technical details of this figure.

Applications to three-node networks with deterministic oscillations
We now focus on three-node networks, where three genes regulate each other in a cyclic manner (Fig. 4(a), left).Feedback is mediated by cooperative binding of h i copies of protein P i to gene G i+1 for i = 1, 2, 3, where G 4 = G 1 .Again, depending on whether ρ ui < ρ bi or ρ ui > ρ bi for i = 1, 2, 3, the network can be a repressilator (three negative regulations) [38], a Goodwin  4-HM for a three-node gene network, where three genes regulate each other along the counterclockwise direction.Feedback is mediated by cooperative binding of hi copies of protein Pi to gene Gi+1, where G4 is understood to be G1.(b) A repressilator with cooperative binding.Here the cooperativities are chosen as hi = 3 for i = 1, 2, 3 such that the deterministic system of rate equations produces sustained oscillations.(c) Time evolution of the mean and Fano factor of fluctuations in the number of molecules of protein P1 computed using the SSA (with 10 5 trajectories), LMA, and 2-HM.(d) Comparison of the time-dependent distributions of protein P1 at 15 time points computed using the SSA (with 10 5 trajectories), LMA, 2-HM, and hybrid SSA+2-HM (with 2000 trajectories).See Supplementary Note 1 for the technical details of this figure .model (one negative regulation and two positive regulations) [39], or a positive feedback loop [40].
As for previous examples, Holimap transforms the nonlinear network into a linear one (Fig. 4(a), right).We now focus on the repressilator illustrated in Fig. 4(b), where the cooperativities are chosen as h 1 = h 2 = h 3 = 3.Here high cooperativities are chosen since we require the corresponding deterministic system of rate equations (Supplementary Note 3) to produce sustained oscillations.According to simulations, deterministic oscillations are not observed when h i ≤ 2. Fig. 4(c) illustrates the oscillatory time evolution of the mean and Fano factor (the variance divided by the mean) of fluctuations in the number of protein P 1 computed using the SSA, LMA, and 2-HM.Note that here we do not consider the 4-HM because, as previously mentioned, it may cause numerical instability when computing time-dependent distributions.The LMA fails to reproduce damped oscillations in the time evolution of the mean and Fano factor, while Holimap excellently captures these oscillations.Note also that the LMA significantly underestimates the variance of fluctuations and hence leads to a much smaller Fano factor in the limit of long times.
Fig. 4(d) compares the time-dependent protein distributions computed using the SSA, LMA, and 2-HM.Interestingly, both the LMA and 2-HM accurately reproduce the protein distribution at small times (t ≤ 3).However, the LMA fails to reproduce bimodality at intermediate and large times since it underestimates noise.In contrast, Holimap performs remarkably well in predicting the complete time evolution of the protein distribution.
Thus far, we have considered regulatory networks where each gene is regulated by one TF; however, many genes are regulated by a multitude of TFs which are often shared between multiple genes [41].In Supplementary Note 4, we investigate gene networks with two TF binding sites.We show that Holimap performs excellently in capturing the protein distributions as well as the bimodal region, independent of the type of network topology and the type of TF binding (independent, positive cooperative, and negative cooperative binding [42]).
A hybrid combination of SSA and Holimap provides highly efficient computation of complex gene network dynamics The FSP and SSA are two widely used methods for solving the dynamics of stochastic chemical reaction systems.While FSP yields an accurate distribution, from a practical point of view, it is only applicable to small networks where protein numbers are not very large; for large networks, the size of the state space leads to an enormous computational cost [22].The SSA can also be computationally expensive, particularly when the network has multiple reaction time scales [23].When fluctuations are large, it can yield a non-smooth distribution, from which it is sometimes even difficult to determine the number of modes.To overcome this, a huge number of stochastic trajectories may be needed to obtain statistically accurate results.Holimap provides an accurate and smooth approximation of the protein distributions; however, it becomes computationally slow when the network is complex or the cooperativity is large since in these case we have to solve a large number of moment equations.This raises an important question: is it possible to develop a highly efficient and accurate computation method of stochastic gene network dynamics that yields a smooth distribution?
To address this question, we propose a hybrid method that combines the SSA and Holimap.This method consists of three steps (Fig. 5(a)).First we use the SSA to generate a small number of trajectories (usually a few thousand trajectories are enough) from which we compute the steady-state or timedependent sample moments of protein numbers.We then use the latter to compute the approximate effective parameters of the linear network.Finally we use FSP to compute the protein distribution of the linear network with effective parameters to approximate that of the nonlinear one.For example, for the autoregulatory circuit illustrated in Fig. 2(a), we substitute the sample moments obtained from the SSA into Eq.( 4) to compute the approximate values of σu and σb , and then use the marginal protein distribution of the linear network to construct the 2-HM of the nonlinear network.
This hybrid SSA+Holimap method is computationally much faster than the SSA because the number of trajectories needed to obtain good approximations to low-order moments is much less than that needed to smooth protein distributions.It also computationally less expensive than Holimap since there is no need to solve a large number of moment equations.To test this hybrid method, we compare the time-dependent distributions for the repressilator calculated using the SSA, LMA, 2-HM, and hybrid SSA+2-HM (Fig. 4(d)).Fig. 5(b) also compares the CPU times and the accuracy of these methods.The number of SSA trajectories N needed for SSA+2-HM is chosen such that the HD between the distributions obtained from N trajectories and those obtained from 3N trajectories is less than 0.02 (averaged over all time points) -a sample size of N = 2000 is sufficient to satisfy this criterion.Notably with almost the same CPU time, the hybrid method yields distributions that are much more accurate than those obtained from only the SSA with the same number of trajectories -the HD for the former is only 0.04 -0.06, while for the latter it is 0.11 -0.13; here the distributions obtained from the SSA with 10 5 trajectories are used as a proxy of the ground truth when computing the HDs.We also note that SSA+2-HM yields distributions that are practically as accurate as the 2-HM but is over 16 times faster (28s vs 7 min 39s).
To further test the accuracy of SSA+Holimap, we apply it to a random M -node gene network (Fig. 5(c)), where any pair of nodes has a probability of 2/M to be connected.This guarantees that each gene on average regulates two genes.When connected, each direct edge has an equal probability to be positive or negative regulation; autoregulation is also allowed.The details of the stochastic model are described in Methods.We then apply the 2-HM to transform the nonlinear random network into a linear one and then use 2000 SSA trajectories to estimate the effective parameters of the linear network.Fig. 5(d) illustrates the CPU times and HDs against the number of nodes M for SSA+2-HM and the SSA with the same number of trajectories.Again an SSA with 10 5 trajectories is used to generate a proxy of the ground truth distribution when computing the HDs.Clearly, the two methods yield almost the same CPU times that approximately linearly scale with M .This is because for SSA+2-HM, almost all time is spent on simulating the SSA trajectories, while solving the linear network consumes very little time.However, compared with an SSA with 2000 trajectories, SSA+2-HM gives rise to markedly lower HD values, which are insensitive to M .

Discussion
In this paper, we have constructed a novel computational method, Holimap, for the efficient and accurate prediction of the protein number distributions of a general gene regulatory network.We have showcased the method by applying it to a variety of networks including simple autoregulatory loops where a gene influences its own expression, two-gene systems such as the toggle switch, three-gene systems such as the repressilator, and complex randomly connected networks with numerous interacting genes.Notably, we have demonstrated that a hybrid method that uses both Holimap and the SSA leads to much more accurate distributions than solely using the SSA, with practically no increase in the CPU time and an accuracy that is independent of the number of interacting genes in the network.
Some of the advantages of our method over other common approximations in the literature are as follows: (i) Holimap does not sacrifice the discrete nature of molecular reactions since the approximate distributions are solutions of the CME of the effective linear network.This is unlike many common methods that achieve a speed increase by making use of a continuum approximation of the CME such as the Fokker-Planck / Langevin equations [43,44] or partial integro-differential equations [45,46]; (ii) Holimap does not assume the protein number distribution to be of a simple type such as the Gaussian, Poisson, Lognormal or Gamma distributions, as commonly assumed by conventional moment-closure methods [47,48] -the solution of the linear network that Holimap utilizes is very flexible and spans a very large number of possible distribution shapes including those with multiple modes and significant skewness.For example, if each gene in a complex regulatory network switches between a number of states for which only one is active, then Holimap approximates the protein distribution for each gene by that of a multi-state gene expression model with no regulatory interactions (Supplementary Note 4) for which the analytical steady-state solution is known to be a generalised hypergeometric function [49], which includes a large number of special functions as special cases.
The main limitation of the present method is that there are no analytical guarantees that the effective parameters of the linear network are positive definite for all times.Nevertheless for all examples using the 2-HM in this paper, we have numerically found this to be the case and hence we are confident that the linear network obtained by the 2-HM procedure is generally physically interpretable.In contrast, we observed that the 4-HM procedure can occasionally give rise to negative parameter values (typically when the binding and unbinding rates are large) and hence should be used more cautiously.Ongoing work aims to extend the method to predict both mRNA and protein dynamics, including their joint distribution for pairs of genes.
Concluding, we have devised a method that overcomes many of the known difficulties encountered when simulating complex stochastic gene network dynamics.We anticipate that Holimap will be useful for investigating noisy dynamical phenomena in complex regulatory networks where intuitive understanding is challenging to attain and simulations using the SSA become computationally prohibitive.
Fig.1.Illustration of Holimap and its advantages over the SSA.Holimap decouples gene-gene interactions in a nonlinear regulatory network and transforms it into a linear network with multiple effective parameters, some of which may be time-dependent.The time evolution of protein number distributions (for all genes) of the nonlinear network can be approximately predicted by solving the dynamics of the effective linear network using, e.g., FSP.Compared with the conventional Monte Carlo approach (the SSA), Holimap not only significantly reduces the CPU time, but it also yields an accurate, noise-free prediction of the protein number distributions.

Fig. 2 .
Fig. 2. Holimaps for autoregulatory gene networks in steady-state conditions.(a) Stochastic model of an autoregulatory feedback loop, which includes bursty protein synthesis, protein degradation, cooperative binding of protein to the gene, and unbinding of protein.(b) The LMA maps the nonlinear network to a linear one with effective parameter σb .The high-order reactions G + hP G * in the former are replaced by the first-order reactions G G * in the latter.(c) The 2-HM maps the nonlinear network to a linear one with effective parameters σu and σb .(d) The 4-HM maps the nonlinear network to a linear one with effective parameters σu, σb , ρu, and ρb .(e) Heat plot of the HD for the LMA as a function of the protein burst frequencies ρu and ρ b .Here the HD for the LMA represents the Hellinger distance between the real steady-state protein distribution computed using FSP applied to the nonlinear system and the approximate protein distribution computed using the LMA.(f) Heat plots of the HDs for the LMA and Holimaps as functions of the unbinding rate σu and binding rate σ b when ρ b ρu.The red curves enclose the true bimodal region, i.e. the parameter region in which the protein number has a bimodal distribution, as predicted by FSP.The orange curves enclose the bimodal region predicted by the approximation method.The vertical white dashed line demarcates the region of σu ≥ d, where the linear network given by the LMA can never exhibit bimodality, from the region of σu < d where it can.(g) Comparison of the steady-state protein distributions computed using FSP, LMA, 2-HM, and 4-HM in different regimes of gene state switching.(h) The maximum HD as a function of the cooperativity h for the LMA and Holimaps.Here the maximum HD is computed when σu and σ b vary over large ranges, while other parameters remain fixed.See Supplementary Note 1 for the technical details of this figure.

Fig. 3 .
Fig.3.Holimaps for two-node gene networks in steady-state conditions.(a) Illustration of the 2-HM and 4-HM for a two-node gene network, where two genes G1 and G2 regulate each other.Feedback is mediated by cooperative binding of h1 copies of protein P1 to gene G2 and cooperative binding of h2 copies of protein P2 to gene G1.(b) The two-node network can describe four different feedback loops, according to whether ρui > ρ bi or ρui < ρ bi for i = 1, 2. (c) A negative feedback loop with non-cooperative binding (h1 = h2 = 1).(d) Heat plots of the HDs for the LMA and 4-HM as functions of the binding rates σ b1 and σ b2 .Here the HD represents the Hellinger distance between the real and approximate steady-state distribution of the number of molecules of protein P1.The red curve encloses the true bimodal parameter region computed using FSP, and the orange curve encloses the bimodal region predicted by Holimap.(e) Comparison of the steady-state distributions of protein P1 computed using FSP, LMA, 2-HM, and 4-HM.(f) A toggle switch with cooperative binding (h1 = h2 = 2).(g) Same as (d) but for the toggle switch.The yellow curve encloses the parameter region of deterministic bistability, i.e. the region in which the deterministic rate equations have two stable fixed points and one unstable fixed point.(h) Same as (e) but for the toggle switch.Here the parameters are chosen so that the system displays deterministic bistability.Note that while we only focus on the distribution of protein P1 in (d),(e),(g), and (h), the distribution of the second protein P2 is also accurately predicted by Holimap (Supplementary Fig.2).See Supplementary Note 1 for the technical details of this figure.

Fig. 4 .
Fig.4.Holimaps for three-node gene networks.(a) Illustration of the 2-HM and 4-HM for a three-node gene network, where three genes regulate each other along the counterclockwise direction.Feedback is mediated by cooperative binding of hi copies of protein Pi to gene Gi+1, where G4 is understood to be G1.(b) A repressilator with cooperative binding.Here the cooperativities are chosen as hi = 3 for i = 1, 2, 3 such that the deterministic system of rate equations produces sustained oscillations.(c) Time evolution of the mean and Fano factor of fluctuations in the number of molecules of protein P1 computed using the SSA (with 10 5 trajectories), LMA, and 2-HM.(d) Comparison of the time-dependent distributions of protein P1 at 15 time points computed using the SSA (with 10 5 trajectories), LMA, 2-HM, and hybrid SSA+2-HM (with 2000 trajectories).See Supplementary Note 1 for the technical details of this figure.

Fig. 5 .
Fig.5.A hybrid method combining the SSA and Holimap.(a) SSA+Holimap serves as a highly efficient strategy to approximately solve the dynamics of complex stochastic gene networks.First, the SSA is used to generate a relatively small number of trajectories of the nonlinear network from which the steady-state or time-dependent sample moments are estimated.The sample moments are then used to approximately compute the effective parameters of the linear network.Finally, the protein distributions follow directly by solving the dynamics of the linear network using FSP.(b) Comparison of the CPU times and the accuracy (measured by HDs at t = 30) for different methods.All methods were used to simulate the time-dependent protein distributions shown in Fig.4(d).The HD represents the Hellinger distance between the actual and approximate protein distributions.Here a proxy for the ground truth distribution is computed using the SSA with 10 5 trajectories rather than FSP since the latter is computationally infeasible (see Supplementary Note 5 for the estimation procedure for the CPU time required by FSP).(c) An illustration of a random M -node network, where each pair of nodes has a probability of 2/M to be connected.There are on average 2M directed edges for the network, each having an equal probability to be positive or negative regulation.(d) Comparison of the CPU times and the accuracy (measured by HDs averaged over ten time points and over all proteins) against the number of nodes M for SSA+2HM and the SSA with the same number of trajectories.Here the number of trajectories needed for both SSA+2HM and SSA is chosen as N = 2000.Both methods were used to simulate the time-dependent distributions for a random network.The error bar shows the standard deviation for five different random networks.See Methods for the technical details.In (b) and (d), all simulations were performed on an Intel Core i9-9900K processor (3.60GHz).